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Abstract 

Background: Multiple imputation is becoming increasingly popular for handling missing data. However, it is often 
implemented without adequate consideration of whether it offers any advantage over complete case analysis for 
the research question of interest, or whether potential gains may be offset by bias from a poorly fitting imputation 
model, particularly as the amount of missing data increases. 

Methods: Simulated datasets (n = 1000) drawn from a synthetic population were used to explore information 
recovery from multiple imputation in estimating the coefficient of a binary exposure variable when various 
proportions of data (10-90%) were set missing at random in a highly-skewed continuous covariate or in the binary 
exposure. Imputation was performed using multivariate normal imputation (MVNI), with a simple or zero-skewness 
log transformation to manage non-normality. Bias, precision, mean-squared error and coverage for a set of 
regression parameter estimates were compared between multiple imputation and complete case analyses. 

Results: For missingness in the continuous covariate, multiple imputation produced less bias and greater precision 
for the effect of the binary exposure variable, compared with complete case analysis, with larger gains in precision 
with more missing data. However, even with only moderate missingness, large bias and substantial under-coverage 
were apparent in estimating the continuous covariate's effect when skewness was not adequately addressed. 
For missingness in the binary covariate, all estimates had negligible bias but gains in precision from multiple 
imputation were minimal, particularly for the coefficient of the binary exposure. 

Conclusions: Although multiple imputation can be useful if covariates required for confounding adjustment are 
missing, benefits are likely to be minimal when data are missing in the exposure variable of interest. Furthermore, 
when there are large amounts of missingness, multiple imputation can become unreliable and introduce bias not 
present in a complete case analysis if the imputation model is not appropriate. Epidemiologists dealing with 
missing data should keep in mind the potential limitations as well as the potential benefits of multiple imputation. 
Further work is needed to provide clearer guidelines on effective application of this method. 

Keywords: Missing data, Multiple imputation, Fully conditional specification, Multivariate normal imputation, 
Non-normal data 



Introduction 

Statistical analysis of epidemiological data is often hindered 
by missing data. Multiple imputation is a two-stage process 
whereby missing values are imputed multiple times from a 
statistical model based on the available data and used in 
analyses that combine results across the multiply imputed 
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datasets [1,2]. Such an approach is used increasingly to deal 
with missing data [3,4], but it is often carried out without 
carefully considering the extent of likely gain over a 
complete case analysis nor whether there is the potential to 
introduce bias from a poorly fitting imputation model. 

A common misconception with multiple imputation 
arises from focussing on the mechanics of filling in miss- 
ing values, as if imputation recovers a fully observed sam- 
ple, when in fact the value of multiple imputation (if any) 
relates to whether it recovers information about (popula- 
tion) parameters of interest. Information recovery may be 
in the form of reduced bias or increased precision, and a 
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challenge for the analyst is to assess whether using mul- 
tiple imputation for a given problem presents sufficient 
potential for recovery of information to be worthwhile. 

Once it has been concluded that multiple imputation 
may be of value, the question becomes whether the avail- 
able multiple imputation technology will provide a valid 
approach. There are currently two readily available meth- 
ods for generating imputed datasets. Multivariate normal 
imputation (MVNI) uses a Markov Chain Monte Carlo al- 
gorithm to obtain imputed values assuming a multivariate 
normal distribution for all variables subject to missingness 
[2]. An alternative is fully conditional specification (FCS) 
where separate regression models are fitted for each vari- 
able with missingness, conditional on other variables in 
the imputation model [5,6]. Both approaches assume 
values are "missing at random" (MAR), i.e. the missing- 
ness is dependent on observed values only, and rely on 
parametric assumptions, in particular that continuous 
variables are normally distributed (at least conditionally 
under FCS). If data truly are MAR then multiple imput- 
ation using an appropriate imputation model is a valid ap- 
proach (asymptotically unbiased with correct standard 
errors and coverage [1]), so the accuracy of results is 
determined by the validity of the assumptions made in the 
imputation model [7]. We note that this paper leaves aside 
the important issue of potential sensitivity of results to the 
MAR assumption itself [8,9]. 

Both MVNI and FCS assume specific parametric models 
that do not fit real data perfectly. Our recent paper demon- 
strated the inadequacy of both of these multiple imputation 
approaches in estimating regression coefficients in a stand- 
ard regression analysis when there was missingness in a 
highly skewed covariate when the non-normality was not 
taken into account in the imputation model [10]. The fail- 
ure of model assumptions is likely to be more damaging as 
the amount of missingness increases, since more data will 
be generated from an ill-fitting model [7]. 

The aim of this paper is to explore the recovery of infor- 
mation from multiple imputation, and how this is affected 
by the fraction of observations with missing values. We re- 
port the results of a simulation study in which we generate 
missingness assuming various forms of MAR, so that mul- 
tiple imputation would be valid if performed under a cor- 
rect model, and compare inferences for regression 
parameters under various missing data scenarios between 
multiple imputation and complete case analysis. In our pre- 
vious paper we demonstrated that MVNI and FCS pro- 
duced similar results [10] and hence for this analysis we 
focus on MVNI. In interpreting the results from the simula- 
tion study we focus on what gains can be made by using 
multiple imputation compared to complete case analysis for 
estimating the effect of a binary exposure, and how the po- 
tential gains are affected by which variables contain missing 
values. As a secondary aim, we also explore how potential 



benefits of imputation are affected by the fit of the imput- 
ation model, extending the results of our previous paper 
[10] to explore the effect of ignoring skewness while imput- 
ing in the presence of increasing fractions of missing data. 

Analysis 

Creating the simulated datasets 

As in previous work [10], we use data from a synthetic 
"population" of 971,327 girls created to resemble a real 
epidemiological study [11]. The analysis included six 
variables representing data collected at the time of re- 
cruitment (Wave I): race (black, non-black Hispanic and 
other), school grade (ordinal: years 7 [aged 12-13] to 11 
[aged 16-17]), self reported health (ordinal: 1 = excellent, 

2 = very good, 3 = good, 4 = fair, 5 = poor) and fitness (or- 
dinal - you are physically fit: 1 = strongly agree, . . ., 
5 = strongly disagree), emotional distress (continuous 0- 

3 with higher scores representing higher distress), and 
the primary exposure of interest, a binary indicator for 
whether the girl had dieted in the previous 7 days or 
not. The outcome was the emotional distress score mea- 
sured at a second wave of follow-up (Wave II) one year 
later (again continuous 0-3). Datasets were created by 
drawing random samples of 1000 observations from the 
synthetic population (with replacement between sam- 
ples) and we focus on the estimation of a regression 
model for emotional distress at Wave II: 

E(ldistW2) = a + ^diet + ji 2 ldistWl 

+ fi^mcei + fi^racei + fi^grade 

+ fljiealth + [3-jfitness (1) 

where diet represents the dieting indicator, race; and race 2 
are indicators for being black and non-black Hispanic re- 
spectively, and grade, health and fitness represent the 
three categorical variables. Emotional distress (at Waves I 
and II) was highly positively skewed and therefore was 
analysed on the log e scale (denoted IdistWl and ldistW2 
respectively). As in previous work, we artificially inflated 
the diet effect so that it was borderline statistically signifi- 
cant with the chosen sample size [10]. 

In a second set of analyses we categorised the outcome 
distress at Wave II into depressed and not depressed (by 
dichotomising at an arbitrary threshold of 1), and fo- 
cussed on estimating the parameters of the multiple lo- 
gistic regression: 

distW2i ~ binomial(l,p) 

logit(p) = (f) + r/^iet + n 2 ldistWl + f] 3 racei 

+r/ 4 race2 + t] 5 grade + rj 6 health+ rj 7 fitness 

(2) 

where distW2i is the indicator for being distressed at 
Wave II (=1 if distW2>\ and 0 otherwise) and other 
variables are as previously described. 
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Inducing missingness 

We considered the simple situation where data were 
MAR in a single variable in two scenarios, where there 
was induced missingness in (i) the baseline distress 
measure, a strong predictor of outcome, and (ii) the diet- 
ing indicator, the exposure of interest. 

In both settings, values were set to missing with a 
probability determined by a logistic regression model. 
Missingness in distress at Wave I was determined by: 

logit Vr(distW\ missing) 

= y + S\diet + 8irace\ + Syacei + degrade 

+ S s ldistW2 + 0 x IdistWl (3) 

and in diet by: 

logit Pr (diet missing) = 0 + 0 x diet 

+ 8jrace\ + 8?,race2 

+ degrade + SsldistWl 

+ S 6 ldistWl (4) 

In each case S 2 = log(2.7), S 3 = log(2.7), S 4 = log(1.2), 
and S s = log(1.3), with Sj = log(2.7) in (3) and S 6 = log 
(1.3) in (4). y an d </> were adjusted to obtain 10%, 25%, 
50%, 75% and 90% missingness in Equations (3) and (4) 
respectively. 

In the analyses of the binary depression outcome, 
missingness in distress at Wave I was induced using the 
model: 

logit Pr(distWl missing) 

= <p + S\diet + 5irace\ + 8yacei + degrade 

+ S 7 distW2i (5) 

where Sj, 6 2 , S 3 , S 4 were as above, S 7 = log(2.7) and <J> 
was adjusted to control the amount of missingness. 

Analysis of the simulated data 

For each simulated dataset, with missingness imposed 
according to Equation (3) or (4), we estimated the re- 
gression model of interest (Equation 1) using MVNI and 
a complete case analysis, and similarly for the second set 
of analyses with missingness imposed according to 
Equation (5) and the analysis model given in Equation 
(2). All analyses were performed in Stata version 11 [12], 
with MVNI carried out using "mi impute mvn" with a 
uniform prior distribution. 

MVNI was performed including all covariates from the 
analysis model (Equation 1) and the outcome in the im- 
putation model, to ensure the maximum recovery of in- 
formation. Distress at Waves I and II, and the health 
and fitness scores were transformed to improve normal- 
ity using either a simple log transformation or a shifted 
log transformation ("log-skewO") u = ln( ±x - k), choosing 
k and the sign of x so that u has zero skewness as used 
previously [10]. Imputed values of distress from the log- 



skewO transformation were truncated at the low end at 
the smallest observed value in the sample, with high 
values from both methods truncated at the scale max- 
imum of 3. Grade was included as a linear predictor in 
the imputation model. 

Comparison of methods 

The properties of the regression coefficient estimates 
from each analysis were assessed by comparing the 
results from the 1000 simulated datasets (each of 1000 
observations) to the "true" population parameters from 
the synthetic population of 971,327. In each multiple 
imputation analysis, 20 imputed datasets were created, 
with inferences for the coefficients obtained by com- 
bining results over the imputed datasets using Rubin's 
rules [1]. 

We report the bias, the average (estimated) standard 
error (SE), the standardised bias (calculated as the bias 
divided by the average SE), the mean squared error 
(MSE), and the coverage of the estimated 95% confi- 
dence interval (CI) compared to the population para- 
meters, as described previously [10]. Based on the 
simulation sample size of 1000, the Monte Carlo (MC) 
error of the estimates across the repeated samples can 
be calculated as S£/vT000, and the estimated coverage 
should lie in the range 93.6% to 96.4% (with 95% 
probability), if the true coverage is equal to the nom- 
inal 95%. There is also MC error due to the finite 
number of imputations, but this additional uncertainty 
was small compared with the magnitude of the esti- 
mates [13]. Findings were similar for each of the 
health and fitness covariates, so only the results for 
the health effect are reported. 

Results from the simulation study 

Table 1 displays pairwise Spearman rank correlations be- 
tween the continuous and ordinal variables in the ana- 
lysis model for the complete synthetic population, to 
provide a simple description of bivariate associations be- 
tween these variables (notwithstanding the limitations of 
correlations for categorical variables). In the population, 
girls who dieted tended to be more distressed at Wave I, 
be in a slightly higher grade, and have higher self- 
reported health and fitness compared to non-dieters, al- 
though these relationships were fairly weak. There were 
slightly larger correlations between health and fitness 
and distress at Wave I (correlations both 0.22). In rela- 
tion to outcome, there were similar levels of distress at 
Wave II in dieters and non-dieters (correlation = 
-0.0008) and a moderate correlation between distress at 
Wave I and outcome (correlation = 0.56). For a fuller de- 
scription of this dataset see [10]. 
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Table 1 Spearman rank correlations between covariates 
and distress at Wave II in the synthetic population 
(n = 971,327) 





Diet 


Log 


Grade Health Fitness 






(distress at Wave 1) 




Log(Distress Wl) 


0.07 






Grade 


0.06 


0.10 




Health 


0.07 


0.22 


0.03 


Fitness 


0.13 


0.22 


0.09 0.42 


Outcomes 


LogPstress Wll) 


-0.0008 


0.56 


0.07 0.20 0.20 



Scenario 1: Continuous outcome - missing data on 
distress at wave I 

When missing data were introduced in distress at Wave I 
(Table 2) and 90% of data were complete, there was slight 
bias in the exposure of interest, the dieting coefficient, in 
the complete case analysis. This bias was eliminated using 
multiple imputation irrespective of the transformation 
used to address the non-normality. There was little bias in 
the estimates for the other parameters for all analyses. For 
all parameters, precision was improved using multiple im- 
putation, although gains in precision were small with this 
low rate of missingness (SE for the diet coefficient 8% lar- 
ger with complete case analysis). 

When 75% of data were observed, both imputation 
models led to smaller bias and improved precision in the 
dieting coefficient compared with complete case analysis 
(SE approximately 24% larger) corresponding to a smaller 
MSE (Figure 1), although even complete case analysis had 



a reasonably small bias (standardised bias = -0.031/ 
0.082 = -0.38) [14]. There were also gains in estimation of 
the health coefficient, a variable with some correlation 
with distress at Wave I, although these gains were less 
pronounced than in the dieting coefficient, which was only 
weakly correlated with distress at Wave I. In contrast, 
there was larger bias in the coefficient for distress at Wave 
I (which contained missing data) when a (simple) log 
transformation was used in the imputation model, com- 
pared with complete case analysis (standardised bias = 
-0.86, compared to -0.22), problems which were amelio- 
rated when the log-skewO transformation was used for 
imputation. It is difficult to interpret SEs in the presence 
of bias, but for the least biased method, using the log- 
skewO transformation, there were much smaller gains in 
precision from multiple imputation in the distress coeffi- 
cient (SE approximately 5% larger with complete case ana- 
lysis) than with the other parameters. A similar pattern 
was observed when 50% of data were complete. 

When only 25% of data were complete, there was 
slightly less bias and substantially greater precision in the 
dieting coefficient under multiple imputation compared 
with complete case analysis (SE approximately 2 times 
larger with complete case analysis), reflected in a much 
smaller MSE (Figure 1). In contrast, there was much lar- 
ger bias (bias = -0.106, MC error = 0.002) and consider- 
able under-coverage of the 95% CIs for the distress 
coefficient from multiple imputation using a log trans- 
formation compared with complete case analysis. Al- 
though the bias from multiple imputation was improved 
using the log-skewO transformation, the coverage from 



Table 2 Performance of methods for regression of (continuous) distress at Wave II, with missing baseline distress 



% complete data 


Method 




Diet (/3 


, = -0.101) 




Distress (j3 2 = 0.554) 




Health (j3 6 = 0.042) 






Bias 


SE 


StdBias 


Coverage 


Bias 


SE 


StdBias 


Coverage 


Bias 


SE 


StdBias 


Coverage 


90% 


CCA 


-0.016 


0.069 


-0.229 


93.9%) 


<0.001 


0.036 


0.014 


95.1%) 


0.002 


0.032 


0.063 


93.7% 




MVNI-log 


-0.001 


0.064 


-0.012 


93.0% 


-0.009 


0.035 


-0.252 


95.1%> 


0.003 


0.031 


0.095 


94.1% 




MVNI-skewO 


-0.001 


0.064 


-0.010 


93.2% 


<0.001 


0.035 


0.001 


95.4%o 


0.002 


0.031 


0.061 


94.2% 


75%o 


CCA 


-0.031 


0.082 


-0.383 


92.4% 


-0.009 


0.040 


-0.219 


95.4%) 


0.001 


0.036 


0.016 


95.3% 




MVNI-log 


0.004 


0.067 


0.059 


95.1%) 


-0.032 


0.037 


-0.863 


84.7%o 


0.003 


0.032 


0.110 


94.7% 




MVNI-skewO 


0.003 


0.066 


0.047 


94.5%) 


-0.009 


0.038 


-0.233 


96.0%) 


0.001 


0.031 


0.033 


94.5% 


50% 


CCA 


-0.054 


0.118 


-0.458 


91.9%) 


-0.01 1 


0.051 


-0.210 


95.5%) 


0.003 


0.048 


0.064 


94.9% 




MVNI-log 


0.002 


0.075 


0.031 


94.6% 


-0.064 


0.045 


-1.426 


68.7%) 


0.008 


0.034 


0.246 


94.5% 




MVNI-skewO 


<0.001 


0.074 


-0.003 


95.2%) 


-0.012 


0.046 


-0.263 


94.7%) 


0.002 


0.033 


0.050 


94.8% 


25% 


CCA 


-0.058 


0.190 


-0.308 


91.8%> 


-0.020 


0.075 


-0.265 


93.7%) 


0.002 


0.071 


0.031 


94.8% 




MVNI-log 


0.004 


0.091 


0.041 


95.4%o 


-0.106 


0.060 


-1.775 


57.4% 


0.013 


0.039 


0.328 


93.9% 




MVNI-skewO 


0.004 


0.092 


0.045 


94.6%) 


-0.023 


0.059 


-0.397 


90.7%) 


0.003 


0.039 


0.072 


94.4% 


10%> 


CCA 


-0.023 


0.345 


-0.068 


94.8% 


-0.025 


0.125 


-0.196 


95.9%) 


0.008 


0.120 


0.071 


96.0% 




MVNI-log 


0.011 


0.129 


0.084 


96.0% 


-0.145 


0.092 


-1.576 


60.9%) 


0.021 


0.049 


0.423 


91.3% 




MVNI-skewO 


0.013 


0.136 


0.096 


95.1% 


-0.056 


0.091 


-0.614 


89.5%o 


0.009 


0.052 


0.183 


94.8% 



Measures of performance are mean values from the estimation of the /3 parameters in Equation 1 across the 1000 simulated datasets of 1000 observations 
(compared to the true values from the synthetic population of 971,327). CCA = Complete Case Analysis: MVNI = Multivariate normal imputation: StdBias = 
standardised bias; SE = average (estimated) standard error across the 1000 datasets. 
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Figure 1 Mean Squared Errors from regression of (continuous) distress at Wave II with missing baseline distress, a) Diet, b) Emotional 
Distress at Wave I Results presented are the average Mean Squared Error across the 1000 simulated datasets in the parameter estimates 
from linear regression of (continuous) emotional distress at Wave II from Equation 1, with missing data on emotional distress at baseline. 
CCA = Complete Case Analysis; MVNI = Multivariate normal imputation. 



this analysis was still below the nominal 95% with this 
large amount of missing data. There was also large bias 
(0.013, MC error = 0.001) and under-coverage of CIs for 
the health coefficient, a covariate correlated with the 



variable subject to missingness, when the simple log 
transformation was used in the imputation model. How- 
ever, the bias and coverage were similar for the health co- 
efficient from multiple imputation and complete case 
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analysis when the log-skewO transformation was used, 
with 82% gain in precision and a lower MSE under mul- 
tiple imputation. 

A similar pattern was observed when only 10% of data 
were complete, with even larger gains in bias and preci- 
sion in the dieting coefficient (e.g. SE 2.5 times larger 
with complete case analysis compared with the MVNI- 
skewO method), but large bias and reduced coverage for 
the distress coefficient from both imputation analyses 
compared to complete case analysis. Again this affected 
the estimation of the health coefficient under multiple 
imputation using the log-transformation but not using 
the log-skewO transformation, where there was similar 
bias to the complete case analysis but a 2.3-fold reduc- 
tion in the SE. 

Scenario 2: Continuous outcome - missing data on diet 

When missingness was imposed in the exposure of 
interest, the dieting indicator, bias and coverage for esti- 
mation of its effect were fairly similar under complete 
case and multiple imputation irrespective of the amount 
of missing data (Table 3, Figure 2; note that only mul- 
tiple imputation results using the log-skewO transform- 
ation are presented because of its superior performance 
in the previous scenario). There were relatively modest 
gains in precision and MSE from multiple imputation 
compared with complete case analysis for the dieting co- 
efficient even when a large proportion of observations 
were incomplete (SE 68% larger with complete case ana- 
lysis when only 10% of data were available). Reduced 
bias, improved coverage and a smaller MSE were 
obtained under multiple imputation for all coefficients 
aside from diet, with larger gains in precision as the 
amount of missing data increased, for example gains of 
over 3-fold were obtained for all coefficients (except 



diet) compared to complete case analysis when only 10% 
of data were available. 

Scenario 3: Binary outcome - missing data on distress at 
wave I 

With the dichotomised outcome and missingness in dis- 
tress at Wave I, there was reduced bias in the dieting co- 
efficient, and improved precision and a reduced MSE for 
the diet and health coefficients under multiple imput- 
ation, compared to complete case analysis, when at least 
50% of data were fully observed (Table 4). However, 
there was large bias (-0.043, MC error = 0.007) in the 
distress coefficient from multiple imputation even when 
as much as 75% of data were observed. When only 25% 
of data were observed, both complete case analyses and 
multiple imputation produced gross under-coverage of 
the 95% CI for all parameters, despite a similar pattern 
of reduced bias in the diet coefficient and smaller SEs in 
the diet and health coefficients under multiple imput- 
ation compared with complete case analysis. It was not 
possible to obtain results with only 10% of data complete 
due to a large number of datasets with zero counts in 
the cross-tabulation of diet and distress at Wave II. 

Summary of results and discussion 

In this study we explored the value of multiple imput- 
ation in a simulated regression analysis and examined 
how this varied according to the pattern of missing data. 
We found that although multiple imputation recovered 
important information about associations concerning 
fully observed variables when there were large amounts 
of missing data in a covariate, much smaller gains in bias 
and precision were seen in the coefficient for the variable 
with missing values, irrespective of whether the outcome 
was continuous or binary. The important difference is 



Table 3 Performance of methods for regression of (continuous) distress at Wave II with missing dieting indicator 



% complete data 


Method 




Diet (0 


, = -0.101) 




Distress (ft = 0.554) 




Health (ft = 0.042) 






Bias 


SE 


StdBias 


Coverage 


Bias 


SE 


StdBias 


Coverage 


Bias 


SE 


StdBias 


Coverage 


90% 


CCA 


0.001 


0.067 


0.019 


94.2% 


-0.006 


0.036 


-0.154 


94.9% 


-0.001 


0.032 


-0.020 


95.1% 




MVNI-skewO 


0.005 


0.065 


0.074 


94.6% 


0.001 


0.034 


0.020 


94.7% 


-0.001 


0.030 


-0.043 


95.3% 


75% 


CCA 


-0.001 


0.075 


-0.014 


95.0% 


-0.014 


0.040 


-0.360 


94.9% 


<0.001 


0.036 


0.009 


95.5% 




MVNI-skewO 


0.008 


0.070 


0.119 


96.1% 


<0.001 


0.034 


0.013 


95.2% 


-0.001 


0.030 


-0.042 


95.6% 


50%> 


CCA 


-0.006 


0.097 


-0.065 


94.2% 


-0.028 


0.049 


-0.577 


91.8% 


0.004 


0.047 


0.086 


94.6% 




MVNI-skewO 


0.014 


0.080 


0.175 


95.6% 


0.001 


0.034 


0.032 


94.6% 


<0.001 


0.030 


-0.002 


96.3% 


25%) 


CCA 


-0.01 1 


0.147 


-0.076 


94.6% 


-0.043 


0.072 


-0.603 


91.9% 


-0.001 


0.071 


-0.014 


94.3% 




MVNI-skewO 


0.018 


0.104 


0.172 


96.1% 


<0.001 


0.034 


-0.009 


93.6% 


-0.002 


0.030 


-0.071 


94.2% 


10% 


CCA 


-0.020 


0.250 


-0.081 


95.8% 


-0.053 


0.118 


-0.452 


92.2% 


0.009 


0.120 


0.077 


94.7% 




MVNI-skewO 


0.011 


0.149 


0.071 


95.0% 


-0.001 


0.035 


-0.042 


95.9% 


<0.001 


0.031 


-0.014 


95.4% 



Measures of performance are mean values in the estimation of the /3 parameters from Equation 1 across the 1000 simulated datasets of 1000 observations 
(compared to the true values from the synthetic population of 971,327). CCA = Complete Case Analysis; MVNI = Multivariate normal imputation; StdBias = 
standardised bias; SE = average (estimated) standard error across the 1000 datasets. 
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(a) 




o 



10 25 50 75 90 

Percentage with complete data 

- CCA ■ MVNI-skewO 



(b) 




o 



10 25 50 75 90 

Percentage with complete data 

— • CCA ■ MVNI-skewO 

Figure 2 Mean Squared Errors from regression of (continuous) distress at Wave II with missing dieting indicator, a) Diet, b) Emotional 
Distress at Wave I Results presented are the average Mean Squared Error across the 1000 simulated datasets in the parameter estimates from 
linear regression of (continuous) emotional distress at Wave II from Equation 1, with missing data on the dieting indicator. CCA = Complete Case 
Analysis; MVNI = Multivariate normal imputation. 
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Table 4 Performance of methods for regression of (dichotomous) distress at Wave II with missing baseline distress 



% with complete data 


Method 




Diet (Hi =0.754) 




Distress {r\ 2 = 6.260) 




Health (r| 6 = 1-132) 






Bias 


cc 
bb 


Mutsias 


Coverage 


Bias 


cc 
bb 


btubias 


Coverage 


Bias 


cc 
bb 


btdbias 


Coverage 


nr\Q/-, 
yUyo 


CC A 


n 1 1 c 


0.257 


-0.449 


92.6% 


U.Uz 1 


0.197 


0.106 


95.0% 


U.UU/ 


0.115 


0.065 


95.0% 




MVNI-skewO 


0.002 


0.226 


0.008 


93.9% 


-0.003 


0.199 


-0.013 


95.3% 


0.005 


0.106 


0.048 


95.1% 


75% 


CCA 


-0.209 


0.319 


-0.653 


91 .9% 


0.013 


0.227 


0.059 


93.70% 


0.006 


0.133 


0.047 


95.4% 




MVNI-skewO 


-0.007 


0.233 


-0.030 


95.8% 


-0.043 


0.230 


-0.187 


94.6% 


0.004 


0.108 


0.039 


94.3% 


50% 


CCA 


-0.243 


0.485 


-0.501 


96.3% 


0.046 


0.306 


0.149 


94.5% 


0.008 


0.180 


0.043 


94.0% 




MVNI-skewO 


0.014 


0.251 


0.057 


95.9% 


-0.082 


0.304 


-0.270 


94.6% 


0.008 


0.115 


0.066 


95.6% 


25% 


CCA 


-0.120 


0.833 


-0.144 


82.4% 


0.091 


0.487 


0.188 


79.4% 


0.021 


0.286 


0.073 


80.0% 




MVNI-skewO 


0.021 


0.308 


0.068 


80.6% 


-0.155 


0.447 


-0.348 


78.5% 


0.002 


0.132 


0.012 


80.5% 



Measures of performance are mean values in the estimation of the /3 parameters from Equation 1 across the 1000 simulated datasets of 1000 observations 
(compared to the true values from the synthetic population of 971,327). CCA = Complete Case Analysis; MVNI = Multivariate normal imputation; StdBias = 
standardised bias; SE = average (estimated) standard error across the 1000 datasets. 

Note in this example it was not possible to include an analysis where only 10% of the data were complete due to a large number of datasets with zero counts in 
the cross-tabulation of diet and distress at Wave II. 



that when there is missing data in confounding variables 
only, we lose potentially available information about the 
relationship between the fully observed variables (includ- 
ing the variable of interest) and the outcome, which can 
be recovered by imputation. In contrast, cases for which 
the variable of interest is missing hold little information 
about the relationship between that variable and the out- 
come, so little information is gained on the coefficient 
for the variable of interest when we impute missing 
values. 

Similar findings of variable benefits of multiple imput- 
ation were observed by White and Carlin [15] who sug- 
gested that when associations are weak, the gain in 
precision for a regression coefficient, calculated as: 

^\SEmi} ^ ^ 

where SE CC and SE MI are the SEs from the complete 
case and multiple imputation analysis respectively, might 
be approximated by the fraction of incomplete cases 
(observations) among those with observed values of the 
independent variable of interest ("FICO"). This concept 
is consistent with our finding that gains in information 
arise primarily when cases with observed values for the 
exposure of interest are recovered by imputing missing 
values in other covariates. In our dataset we had strong 
predictors of missingness and we saw large gains in pre- 
cision compared to the FICO, which perhaps provides a 
lower bound for potential gains. 

Potential information recovery from multiple imput- 
ation needs to be weighed up against the possibility of 
bias being introduced by a poorly fitting imputation 
model. A second aim of this study was to extend our 
previous work which focussed on detailed comparisons 
between the MVNI and FCS imputation methods [10], 
to explore the extent to which the performance of 



multiple imputation was affected by a poorly fitting 
model as the amount of missing data increased. We 
found that although multiple imputation was fairly 
robust to non-normality when there were few missing 
values, there was large bias and poor coverage in the es- 
timation of the regression coefficient for the highly 
skewed continuous covariate when the variable was sub- 
ject to substantial missingness, with similar results for 
the continuous and binary outcome settings. In both set- 
tings, there was also some contamination in the estima- 
tion of coefficients of other variables, particularly those 
that were correlated with the variable subject to missing- 
ness, if the skewness was not removed. Relating back to 
our previous paper [10], it may be worth noting that we 
obtained an essentially identical pattern of results for all 
of the scenarios examined when multiple imputation 
was carried out using the FCS method (fitted using the 
ice command in Stata again using a log and log-skewO 
transformation to address non-normality). 

The results presented in this paper extend our previ- 
ous findings that bias can be introduced if skewness is 
not adequately addressed, highlighting the increasing 
unreliability of multiple imputation methods if skewness 
is not removed as the proportion of incomplete observa- 
tions increases and a larger fraction of data is imputed 
from a mis-specified model [7]. In particular, when the 
majority of observations had missing data, multiple im- 
putation under the normal model became unreliable 
even if the skewness was removed. A similar finding of 
larger bias when there was a larger proportion of miss- 
ing data was seen by Demirtas et al. [16] who explored 
the performance of MVNI in a simulation study with in- 
complete data in various skewed and multimodal vari- 
ables. However, in contrast to our findings, their overall 
conclusion was that MVNI performed reasonably well 
even when the normality assumption was clearly vio- 
lated, particularly when there was a large sample size. 



Lee and Carlin Emerging Themes in Epidemiology 2012, 9:3 
http://www.ete-online.eom/content/9/1/3 



Page 9 of 10 



It has been shown that multiple imputation has greater 
value if variables used in the imputation model are pre- 
dictive of missingness and of the missing values [14]. 
However, the results from the current study suggest that 
multiple imputation can introduce bias in the coeffi- 
cients for variables that are correlated with the covariate 
with missing data when skewness was not removed dur- 
ing imputation, particularly when there are lots of miss- 
ing values, as seen with the health and fitness 
coefficients in the first example. There was less of an im- 
pact in estimating the dieting effect since the diet indica- 
tor was not associated with distress at Wave I. 

Researchers often ask how much missing data can be 
imputed. Marshall et al. [17] reported multiple imput- 
ation to be useful when up to 50% of observations had 
data MAR in one or more variables in their simulations. 
Barzi et al. [18] found inflated variability and conver- 
gence problems with multiple imputation when more 
than 60% of observations had missing data, with varying 
results, depending on the method of multiple imput- 
ation, when 10-60% of observations had missing data. As 
we have demonstrated here, whether multiple imput- 
ation may offer substantial benefit over complete case 
analysis depends on whether missing data is in the ex- 
posure of interest or in covariates, and also on the 
strength of inter-relationships between variables. These 
factors, along with the predictors of missingness and the 
missing data pattern, vary considerably across settings 
making it impossible to specify general rules for when 
multiple imputation will be beneficial (aside from the 
dangers of an ill-specified model). The relative merits of 
multiple imputation and complete case analysis were 
discussed by White and Carlin [15] who showed that 
when data were missing in just one or two covariates 
neither multiple imputation nor complete case was uni- 
versally better than the other, depending on the missing 
data mechanism, although they did find multiple imput- 
ation was superior in a wider range of the settings exam- 
ined. While it does not seem possible to set rules for 
when multiple imputation will be beneficial it is import- 
ant to identify situations where multiple imputation is 
unlikely to be helpful. 

Limitations of this simulation study 

Caution is needed in generalising from the results of a 
single simulation study. In particular these results are 
affected by the inherent structure of the synthetic popu- 
lation used. However, this example clearly illustrates po- 
tential dangers of using multiple imputation when there 
are large amounts of missingness and/or highly skewed 
data. We have considered the simple situation of data 
MAR in a single variable, but in practice there is likely 
to be missingness in a number of variables, with com- 
plex inter-relationships between variables and their 



missingness patterns, which may have a range of conse- 
quences. Finally the analysis presented in this paper 
missingness was highly dependent on a few (known) 
variables, relationships which are likely to be weaker in 
practice, so reducing the potential for information recov- 
ery. Further exploration in more complex scenarios, for 
example with more variables subject to missingness and 
different patterns of missing data, would be beneficial. 

Conclusions 

The results from this study demonstrate that although it 
may be important to use multiple imputation to recover 
information when there are missing data in covariates 
required for adjustment, multiple imputation has sub- 
stantially less value when there are missing data (even 
when MAR) in the exposure of interest. This study also 
highlights the potential for poor results from standard 
approaches to multiple imputation particularly when a 
large fraction of individuals have missing data. 

These findings have important implications, particu- 
larly for large epidemiological studies where there may 
be varying degrees of missingness across a number of 
variables. Firstly, these results demonstrate that when 
there is a lot of missing data, large-scale imputation may 
be futile, and in particular it can introduce more bias 
than a complete case analysis if the imputation model 
does not fit the data well. For this reason when using 
multiple imputation we would recommend carrying out 
a complete case analysis in parallel, as suggested by 
White and Carlin [15] - this provides reassurance if in- 
ference from the two are similar, but may highlight 
issues with one or both approaches if results differ sub- 
stantially. Secondly these results suggest that it is im- 
portant to develop tailored imputation models to 
address specific analysis questions. This recommenda- 
tion clashes with the idea that multiple imputation could 
be carried out by an "expert imputer" to create a set of 
imputed datasets that could be used by data analysts to 
answer a range of research questions [1,19]. 

Our overall conclusion is that the current wave of en- 
thusiasm for multiple imputation should be tempered 
with greater caution about its limitations. Further work 
is needed to elucidate more detailed guidelines for its ef- 
fective application. 
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